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Abstract 

A new tightly coupled speech and natural language integration model is presented for a 
TDNN-based continuous possibly large vocabulary speech recognition system for Korean. Un- 
like popular n-best techniques developed for integrating mainly HMM-based speech recognition 
and natural language processing in a word level, which is obviously inadequate for morpholog- 
ically complex agglutinative languages, our model constructs a spoken language system based 
on a morpheme-level speech and language integration. With this integration scheme, the spoken 
Korean processing engine (SKOPE) is designed and implemented using a TDNN-based diphonc 
recognition module integrated with a Viterbi-based lexical decoding and symbolic phonologi- 
cal/morphological co-analysis. Our experiment results show that the speaker-dependent contin- 
uous eojeol (Korean word) recognition and integrated morphological analysis can be achieved 
with over 80.6% success rate directly from speech inputs for the middle-level vocabularies. 

Keywords: speech and natural language integration, spoken language processing, 
morphological analysis, phonological modeling, Viterbi search, time-delayed neural 
networks 



1 Introduction 

A spoken natural language system requires many different levels of knowledge sources includ- 
ing acoustic-phonetic, phonological, morphological, syntactic, semantic and even pragmatic levels. 
These knowledge sources are grouped and processed by either speech processing models or sta- 
tistical/symbolic natural language processing models. Since the speech and the natural language 
communities have conducted almost independent researches, these models were not completely inte- 
grated and often biased by neglecting either acoustic-phonetic or high-level linguistic information. 
Current speech and natural language integration mainly relies on word-level n-best search tech- 
niques ||, U as shown in figure |l[ For HMM-based speech recognition systems, the n-best search 
techniques have been successfully applied to the integration of speech and natural language process- 
ing. However, current implementations of n-best techniques only support the integration at a word 
level by directly producing the n-best list of candidate sentences, and this type of loose coupling is 
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Figure 1: N-best lists: current speech and natural language integration method 



only suitable for the integration of existing speech and natural language systems, such as, e.g. |§, |4j. 
The n-best search is viable only for short sentences since the necessary n grows exponentially with 
the sentence length. Because the n-best search directly generates word sequences, phonetic and 
natural language dictionaries must have full word entries, which is obviously inadequate for mor- 
phologically complex agglutinative languages such as Korean. The dictionary size will grow very 
fast for full word entries because new words can be almost freely generated by concatenating the 
constituent morphemes in these languages (e.g. noun plus noun-endings or verb plus verb-endings). 

In this paper, we present a new morphologically conditioned integration architecture of speech 
and natural language processing for morphologically complex agglutinative languages. The in- 
tegration is based on a Viterbi-based lexical decoding and symbolic phonological/morphological 
co-analysis. The Viterbi search |J is performed on diphone (explained in section ||) sequences 
generated from a TDNN (time-delay neural network)-based Korean speech recognizer Q, and the 
search process is tightly integrated with a morphological and phonological constraint checking. 
We present a new integration architecture, not for popular HMM-based systems, but for recently 
developed connectionist speech recognition systems. Connectionist speech recognition has several 
advantages over the classical statistical speech processing jjj. Especially, the TDNN model j|] has 
been widely used to model the time shift invariance of speech signals. In this regard, we will present 
a morpheme-level integration method for a TDNN-based continuous speech recognition model for 
Korean. This paper is organized as follows. Section || briefly explains the characteristics of spo- 
ken Korean for general readers. Section || introduces our speech and natural language integration 
architecture, and section || and section [| more elaborate the introduced integration architecture. 
Section [6| shows several experiment results to demonstrate the performance, and section [7| compares 
our integration scheme with similar related researches. Section |8] draws some conclusions. 



2 Features of spoken Korean 

This section briefly explains the linguistic characterists of spoken Korean before describing the 
integration architecture. In this paper, Yale romanization is used for representing the Korean 
phonemes. 1) A Korean word, called eojeol, consists of more than one morphemes with clear-cut 
morpheme boundaries (Korean is an agglutinative language). 2) Korean is a postpositional lan- 
guage with many kinds of noun-endings, verb-endings, and prefinal verb-endings. These functional 
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morphemes determine the noun's case roles, verb's tenses, modals, and modification relations be- 
tween eojeols. 3) Korean is a basically SOV language but has relatively free word order compared 
to the rigid word-order languages, such as English, except for the constraints that the verb must 
appear in a sentence-final position. However, in Korean, some word-order constraints do exist such 
that the auxiliary verbs representing modalities must follow the main verb, and the modifiers must 
be placed before the word (called head) they modify. 4) The unit of pause in speech (which is 
called eonjeol) may be different from that of a written text (an eojeol). The spoken morphological 
analysis must deal with an eonjeol (fragment of sentence) since no eojeol boundary is provided in 
the speech. 5) Phonological changes can occur in a morpheme, between morphemes in an eojeol, 
and even between eojeols in an eonjeol. These changes include consonant and vowel assimilation, 
dissimilation, insertion, deletion, and contraction, and so on. 

3 SKOPE system architecture for morpheme-level integration 

The morpheme-level integration technique processes phoneme-like unit (PLU) sequences (speech 
recognizer's outputs) using both Viterbi-based lexical decoding (for morpheme) and symbolic 
phonological/morphological co-analysis, and uses a single unified phonetic-morpheme (UPM) dic- 
tionary for both speech and language processing. This morpheme-level integration scheme is able to 
utilize natural language morphological processing techniques in an early stage of spoken language 
processing compared with the classical approaches of word-level speech and language integration. 
The morpheme-level integration also renders a phonological rule modeling possible in the early 
stage. The phonological/morphological analysis can be performed together using the single UPM 
dictionary, and the dictionary size becomes stable regardless of the vocabulary size because only 
the morphemes are encoded and the new words can be processed by using the existing morphemes 
in the dictionary. 

Figure ||| shows the SKOPE architecture, a morpheme-level integration model of speech and 
natural language processing for Korean. The speech signal is analyzed using the TDNN diphone 
recognizer. The diphone recognizer is composed of a hierarchy of TDNN networks. The recognized 
diphone sequences are decoded using the Viterbi search on the trie-structured UPM dictionary to 
segment out the target morpheme candidates. In the UPM dictionary, each morpheme's phonetic 
header is a HMM (hidden markov model) network using the diphone symbols. The Viterbi decoded 
candidate morphemes are stored in a triangular table to be properly connected during the mor- 
phological processing. From the candidate morphemes, the Viterbi-based morphological analyzer 
produces the morphologically analyzed eojeols by handling morphotactics verification and irregular 
conjugations. The phonological modeling is tightly integrated into the morphological processing 
through a declarative phonological rule modeling in the UPM dictionary. Outputs of the integrated 
architecture, that is, analyzed eojeol sequences, can be directly fed to the upper level syntax and 
semantics analysis modules which are described in ||. 

4 Diphone-based speech recognition 

For large-vocabulary continuous speech recognition, a sub-word level recognition is usually per- 
formed. We select a group of diphones for our phoneme- like units (PLUs) because direct phoneme 
recognition in Korean is very difficult. The 46 Korean phonemes are very similar each other espe- 
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Figure 2: SKOPE speech and language morpheme-level integration architecture. Syntax and more 
high-level processing steps are not in the scope of this paper. 
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Figure 3: Korean diphone groups (V: vowel, CI: syllable-first consonant, C2: syllable-final conso- 
nant). In C2C1 type, the C2 must be one of the nasals or liquids/glides which are similar to vowels. 
Yale romanization is used to specify the diphone symbols. 



cially in the following cases: 1) the Korean diphthongs are hard to distinguish from the mono- vowels, 
and 2) the syllable- final consonants are hard to differentiate from the syllable- first consonants. The 
selected diphone groups (figure |3|) have more suitable features for co-articulation modeling than 
the phonemes and are much fewer in numbers than the popular triphones Jl(|. We also introduced 
CC-type (syllable-final consonant, syllable-first consonant) diphones for smooth transition model- 
ing between syllables in Korean. Figure |I| shows the hierarchical structure of a group of TDNNs for 
diphone recognition, and also shows the architecture of each component TDNN. The whole diphone 
recognizer consists of total 19 different TDNNs for recognition of the defined Korean diphones. We 
re-classified the total diphones into 18 different groups according to the vowel characteristics in the 
diphones. The top-level TDNN (vowel group TDNN) identifies the 18 vowel groups of the diphones 
using relatively low frequency signal vectors (under 4 KHz). Each 18 different sub-group TDNN 
recognizes the target diphones using the whole frequency signal vectors. For the training of each 
TDNN, we manually segmented the digitized speech signals into 200 msec range (which includes 
roughly left-context phoneme, target diphone, and right context phoneme), and applied 512 order 
FFTs and 16 step mel-scaling |§ to get the filter-bank coefficients. Each frame size is 10 msec, 
so 20 (frames) by 16 (mel-scaling factor) values are fed to the TDNNs with the proper output 
symbols, that is, the vowel group name or the target diphone name. After the training of each 
TDNN, the diphone recognition is performed by feeding 200 msec signals to the vowel group TDNN 
and subsequently to the proper sub-group TDNNs according to the extracted vowel group. The 
200 msec signals are shifted by 30 msec steps and continuously fed to the networks to process the 
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Figure 4: Top: hierarchical organization of the group of TDNNs for entire diphone recognition. 
Bottom left: TDNN architecture for vowel group identification. Note the cc group contains no 
vowels. Bottom right: Architecture of the sub-TDNN for /a/ vowel group recognition. The other 
17 sub-TDNNs have the same architecture, but different number of output units according to the 
number of diphones in each of the vowel group. 
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Figure 5: The unified phonetic-morpheme (UPM) dictionary for entries ci-wu (delete), I (adnomi- 
nalizing verb-ending), and swu (bound- noun). 



continuous speech in an eonjeol (pause unit of Korean speech). The final outputs are sequence of 
diphones for each 200 msec range in 30 msec intervals. The hierarchical TDNN structure shortens 
the training time and provides easily extensible system design. The entire recognition rate critically 
depends on the vowel group TDNN in this hierarchical structure. 



5 Viterbi-based morphological analysis 

Unlike conventional morphological analyses for text inputs, our morphological analysis starts with 
the recognized diphone sequences which contain insertion, deletion, and substitution speech recogni- 
tion errors. The conventional morphological analysis procedure p| , i.e., morpheme segmentation, 
morphotactics modeling, and orthographic rule (or phonological rule) modeling, must be augmented 
and extended to cope with the recognition errors as follows: 1) The conventional morpheme segmen- 
tation is extended to deal with the speech recognition errors and between-morpheme phonological 
changes as well as irregular conjugations during the segmentation, 2) the morphotactics modeling 
is extended to cope with the complex verb-endings and noun-endings in Korean, and 3) the ortho- 
graphic rule modeling is combined with the phonological rule modeling to correctly transform the 
diphone transcriptions (phonetic spelling) into the orthographically spelled morpheme sequences. 

The central part of the morphological analysis lies in the dictionary construction. In our UPM 
(unified phonetic-morpheme) dictionary, each phonetic transcription of single morpheme has a sep- 
arate dictionary entry. Figure |5] shows the UPM dictionary both for speech and language processing 
with three different morpheme entries ci-wu, I, swu. The extended morphological analysis is based 



on the well-known tabular parsing technique for context-free languages [12] and augmented to han- 
dle the Korean phonological rules and speech recognition errors in the diphone sequence inputs. 
Figure ^ shows the extended table-driven morphological analysis process. The example diphone se- 
quence was obtained from the input speech ci-wul-sswu (meaning: can/cannot be removed), and the 
morphological analysis produces ci-wu+l+swu (remove+ADNOMINAL+BOUND-NOUN), where 
'+' is the morpheme boundary, and '-' is the syllable boundary. The morpheme segmentation is 
basically performed using the Viterbi-based lexical decoding to recover the possible errors in the 
diphone sequences. For Viterbi search, the phonetic transcription headers for each morpheme in 
the UPM dictionary are converted into diphone transcription headers, and each converted header 
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Figure 6: Morphological analysis of diphone sequences. From top: output morpheme sequence in 
an eonjeol, triangular parsing table, and input diphone sequence. 



is turned into a simple HMM. The converted HMMs are organized into a trie data structure for 
efficient search (see figure @) , and form a trie-structured diphone-based HMM index. The HMMs 
are the simplest ones which have only left-to-right and self transitions. Additional diphone nodes 
(marked with thick circles) are inserted for smooth inter-morpheme co-articulation modeling. The 
transition probability in each HMM is defined: 

a i = j 

Iff. i± jAd 1 = Si Ad t+1 =8j 
otherwise 

where aij is a transition probability from state i to state j, N is the number of all possible transitions 
from state i. d l is a diphone observable at time t, and Sj is a diphone at state i. This model 
assigns self-transition probability a and left-to-right transition probability . All other transition 
probabilities are zeros. In each state, the diphone emission probabilities are defined: 



bi(k) 



j3 d k = Si 
i^r otherwise 



where bi{k) is a probability of producing diphone dk at state i, and M is the number of all the 
diphones in the model. We adjust a and f3 experimentally, and the flexible adjustment helps to cope 
with the insertion and deletion errors in the diphone sequences. The Viterbi search with the trie- 
structured HMM index on the input diphone sequences segments out all the possible morphemes 
in the given diphone sequence, and enrolls all the segmented morphemes into the triangular table 
on the proper positions. For example, in figure ||, morphemes such as ci (carry), ci-wu (delete), 
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Figure 7: Trie-structured diphone-based HMM index for morphemes ciwu, I, swu, iss, nun. In 
each node, if a path from the root (start node) completes a morpheme, a pointer leads to the 
corresponding morpheme entry in the UPM dictionary. The self-transition for each node is left out 
except the root for figure simplicity. 



I (adnominal verb-ending), wul (cry), swu (bound-noun) are segmented out and enrolled in the 
table position (1,2), (1,3), (4,4), (3,4), (5,6). The position (i,j) designates the starting and ending 
position of each morpheme in the given input eonjeol. 

The morphotactics modeling is necessary after all the morphemes are enrolled in the table in 
order to combine only legal morphemes into an eojeol (Korean word), and the process is called 
morpheme-connectivity-checking. Since Korean has well developed postpositions (noun-endings, 
verb-endings, prefinal verb-endings) which play as grammatical functional morphemes, we must 
assign each morpheme proper part-of-speech (POS) tags for the efficient connectivity checking. 
Our more than 400 POS tags which are refined from the 13 major Korean lexical categories are 
hierarchically organized, and contained in the UPM dictionary (in the name of left and right 
morphological connectivity, see figure ||). In the case of idiomatic expressions, we place such idioms 
directly in the dictionary for efficiency, where two different POS tags are necessary for the left and 
the right morphological connectivity. For single morpheme, the left and the right POS tags are 
always the same. The separate morpheme-connectivity-matrix (sometimes, it is called morpheme- 
adjacency-matrix) indicates the legal morpheme combinations using the POS tags defined in the 
dictionary. So the morphotactics modeling is performed by utilizing two essential components: the 
POS tags (in the dictionary) and the morpheme-connectivity-matrix. For example, in figure ^, the 
morpheme ciwu (in position (1,3)) can be legally combined with the morpheme I (in position (4,4)) 
to make ciwu+l (delete-l- ADNOMINAL, in position (1,4)) but ci cannot be combined with wul to 
make ci+wul even if they are in the combinable positions. 

The orthographic rule modeling must be integrated with the phonological rule modeling in 
spoken Korean processing. Since we must deal with the erroneous speech inputs, the conventional 
rule-based modeling requires so many number of rule applications [13|. So our solution is based on 
the declarative modeling of both orthographic and phonological rules in a uniform way. That is, in 
our UPM dictionary, the conjugated verb forms as well as the original verb forms are all enrolled, 
and the same morphological connectivity information is applied for both original verb forms as 
well as the conjugated ones. The phonological rule modeling is also accomplished declaratively 
by having the separate phonemic connectivity information in the dictionary (see figure ||). The 
phonemic connectivity information for each morpheme declares the possible phonemic changes 
in the first (left) and the last (right) positioned phonemes in the morpheme, and the phoneme- 
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connectivity-matrix indicates the legal sound combinations in Korean phonology using the defined 
phonemic connectivity information. For example, in figure ||, the morpheme I can be combined with 
the morpheme swu during the morpheme connectivity checking even if swu is actually pronounced 
as sswu (see the input in figure ^). The phoneme-connectivity- matrix supports the legality of the 
combination of I sound with changed s to ss sound. This legality comes from the Korean phonology 
rule glotalization (one form of consonant dissimilation) stating that s sound becomes ss sound 
after I sound. In this way, we can declaratively model all the major Korean phonology rules such 
as (syllable-final consonant) standardization, consonant assimilation, palatalization, glotalization, 
insertion, deletion, and contraction. 



6 Implementation and experiment results 

The SKOPE speech and natural language integration architecture was implemented using a stan- 
dard C and X-window user interface on a UNIX/Sun Sparc platform. The system's inputs are 
carefully articulated Korean speeches in a normal laboratory environment, and the outputs are 
morphologically analyzed eojeol sequences which can be directly used by Korean syntactic and 
semantic analysis modules. We constructed a 1000 morpheme-entry UPM dictionary in a UNIX 
operating system domain []14|], and built morpheme connectivity and phoneme connectivity matrices 
for the phonological/morphological co-analysis. The UPM dictionary is indexed using the diphone 
transcribed HMM headers for each morpheme, which are organized into a trie. Since we don't have 
any standard segmented Korean speech database yet, we constructed our own by recording and 
manually segmenting 73 most frequent Korean diphones. The 73 diphones are acquired from the 
300 Korean eojeols (each eojeol is pronounced 15 times by a female speaker) in 50 Korean sentences 
which can appear in natural language commanding to the UNIX operating system [14]. 



Several experiments were performed to verify the system's performance of time-shift invariance, 
diphone recognition, and final eojeol recognition including the spoken language morphological anal- 
ysis. In each experiment, the input speech patterns were prepared as follows: eojeols were recorded 
in a normal laboratory environment with an average S/N ratio of 12 dB. Speech data were sampled 
at 16kHz-16bit, and hamming- windowed. From this windowed data, 512-point DTFTs were com- 
puted at 5 msec intervals. The DTFTs were used to generate 16 Mel-scale filter-bank coefficients 
at 10 msec frame size ||. These spectra were normalized to produce suitable input levels for the 
four-layer TDNNs. We used hyperbolic arc tangent error function for the weight updating Jlq] in 
the back propagation training, and updated the weights after a small number of iterations pq|. 



6.1 Time-shift invariance of Korean diphones 

We generated 2400 diphone samples for typical 12 Korean diphones. The input patterns for two 
test cases are set the same in order to compare the no-time-shift and time-shift cases. Figure || 
shows that the Korean diphone recognition maintains the time shift invariance property of TDNN 
and suggests the optimal time interval near 200 - 250 msec. 



6.2 Comparison of diphone recognition vs. phoneme recognition 

This experiment is to show that diphones can improve the recognition rate of Korean vowels re- 
gardless of many rising diphthongs compared with the phoneme recognition. In the test, we set 
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Figure 9: Diphone recognition versus phoneme recognition test 



150 msec time range for the phoneme and 200 msec for the diphone segmentation. Compared with 
the phoneme recognition, figure ^ shows that diphone recognition performance doesn't drop much 
when the number of targets with similar features doubly increases. 



6.3 Performance of continuous diphone recognition 

In this experiment, we pronounced carefully chosen 66 eojeols 15 times to generate about 5500 
diphone patterns for training. The 5500 training samples are used to train the vowel group TDNN 
and 18 different sub-TDNNs for each diphone group. During the recognition, the new 262 eojeols 
are selected to generate the test patterns of 2432 eojeols, and these test patterns are shifted 30 
msec during the recognition to obtain the TDNN diphone spotting performance in a continuous 



speech. Figure lC-a shows the continuous diphone spotting performance. We have total 7772 
target diphones from the 2432 test eojeol patterns. The correct designates that the correct target 
diphones were spotted in the testing position, and the delete designates the other case (including 
the substitution errors). The insert designates that the non-target diphones were spotted in the 
testing position. To compare the ability of handling the continuous speech, we also tested the 
diphone spotting using the hand segmented test patterns with the same 7772 target diphones. 
Figure [K^-b shows the segmented diphone recognition performance. Since the test data are already 



hand-segmented before input, there are no insertion and deletion errors in this case. The fact that 
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Figure 10: Continuous diphone spotting versus segmented diphone recognition 



the segmented speech performance is not much better than the continuous one (93.8% vs. 93.3%) 
demonstrates the diphone's suitability to handling the Korean continuous speech. 

6.4 Performance of continuous speech morphological analysis 

In order to test the ability of full eojeol recognition including the Viterbi-based lexical decoding 
and phonological/morphological co-analysis, a middle- vocabulary experiment was carried out. The 
task is a speaker-dependent and continuous eojeol recognition which produces the morphologically 
analyzed eojeol sequences directly from the speech inputs. In the process, the speech recognizer 
produces the erroneous diphone sequences in input eonjeols, and then the Viterbi morphological 
analyzer segments them with the error correction and produces the final analyzed eojeols. So, in 
this task, all the intermediate steps, that is, diphone spotting, lexical decoding and morphologi- 
cal/phonological analysis, are combined to produce the final recognition performance. The same 
328 eojeols in section |6.3| were fed to the SKOPE integration architecture that has the pre-trained 
TDNNs (with 66 eojeols). Figure pd]-a shows the performance with the trained 66 eojeols and 
figure |ll]-b shows the final performance of the total 328 eojeols. We have total 4266 target mor- 
phemes from the same 328 eojeols used in section |6^ . In the figure, the correct designates that the 
correct morpheme sequences can be analyzed from the speech input, and the delete means that the 
correct morpheme sequences cannot be generated (including the substitution errors). The insert 
designates the percentage of the spurious morphemes that are generated from the insertion errors. 
The performance is about 80.6% correctness in the final morphological analysis with the mostly 
untrained new data, which is quite promising considering the complexity of the task. 



7 Comparison with related researches 

Recently, the idea of sending only n best speech recognition results to a natural language module 
has been implemented using the time-synchronous Viterbi-style search algorithm Q . The algorithm 
was also improved by the word-dependent search [0] and by adding the A* backward tree search 
[17]. The n-best integration scheme has been mostly utilized for HMM-based continuous speech 
recognition systems, and many existing speech systems and natural language systems were suc- 
cessfully integrated using the n-best word search techniques |3|, |j. However, until now, the n-best 
search techniques are only implemented to directly produce the n-best sentences using the word 
sequences, and this word- level integration is inefficient for the morphologically complex languages 
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Figure 11: Continuous speech input morphological analysis performance 



such as Korean. On the contrary, our integration is at the morpheme-level directly decoding the 
PLU sequences with the morphological processing because we need more sophisticated phonolog- 
ical/morphological handling in the early stage of the integration process. The word-level n-best 
integration also assumes the word-level dictionary which is an unreasonable assumption for mor- 
phologically complex languages. According to the Harper and others' recent classification [18], 
n-best integration is a typical loosely-coupled example. 

The HMM-LR integration [19, ^] was implemented using the HMM's phoneme spotting abil- 
ity integrated with the generalized LR parsing techniques [21]. Unlike the n-best integration, the 
HMM-LR integration was more tight and implemented at the phoneme-level by extending the LR 
parser's terminal symbols to cover the phonetic transcriptions. In this scheme, the LR parsing 
selects the most probable parsing results by obtaining the probability of the end-point candidate 
phonemes from the HMM's forward probability calculation. So the total integrated system is 
working by the LR parser's prediction of the next phoneme candidates which are then verified 
by the HMM's phoneme spotting abilities. The idea of extending the LR grammar to the pho- 
netic transcriptions seems to be working for the phoneme-level integration. However, the scheme 
doesn't have any separate language-level dictionary, which results in the degenerated phonologi- 
cal/morphological processing, and also suffers from difficulty in the necessary scale- ups. On the 
contrary, our SKOPE integration architecture focuses on the general phonological/morphological 
handling during the integration which is essential for the agglutinative languages. The idea of ex- 
tending LR grammar to the phonetic transcriptions was also applied to the TDNN-LR integration 
method |2^, ^] which was similarly implemented by replacing HMM's phoneme spotting by the 
TDNN's phoneme spotting. The integration was implemented by dynamic time warping (DTW) 
level-building search |?4| between TDNN's phoneme sequences and LR grammar's phoneme se- 
quences. However, the performance was relatively poor compared with the HMM-LR integration 
method [22|. There are basically two reasons for the poor TDNN-LR performance compared with 
the HMM-LR integration: 1) the TDNN model has rarely been applied to the practical large vo- 
cabulary systems yet, therefore it lacks the fine tuning compared with the popular HMM models, 
and 2) the TDNN model has yet to find a right way to be effectively integrated into the natural 
language processing model. The HMM model supports a natural integration into the general chart- 
based parsing models such as generalized LR parsing because there are well-defined probablistic 
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search techniques in the language as well as in the speech levels. However, output activations of 
the multiple TDNNs are difficult to normalize and therefore difficult to be naturally integrated into 
the popular probabilistic search schemes such as Viterbi search. Our SKOPE architecture adopts 
Viterbi search with pre-defined transition and emission probabilities, and use the Viterbi search 
for only segmenting erroneous diphone strings. All the other morphological processing steps are 
generally performed according to the symbolic natural language processing model. 

The more tightly-coupled systems have also been researched to integrate all the knowledge 
sources of spoken language processing from acoustic to semantic into a single interdependent model 
that cannot easily be separated. In these systems, syntactic parser directly deals with acoustic- 
level inputs. For example, Ney [25] extended CYK parsing algorithm to cover acoustic inputs by 
exhaustively finding all possible endpoints for every terminal symbol. In the similar vein, the HMM 
can be extended to handle recursive embedding for context-free grammar processing [26]. However, 
these acoustic-level syntactic parsers are computationally expensive since the parsing complexity is 
at best 0(n 3 ) where n could be in several hundreds when the parsers directly deal with the speech 
frames. The SKOPE integration is tighter than loosely-coupled n-best techniques, but less tight 
compared with these tightly-coupled systems. We agree that the high-level linguistic constraints 
should restrict the underlying speech recognition in some ways as in the tightly-coupled systems, 
but disagree that the constraints should be in a syntax level. The more tightly-coupled systems 
are often impractical for large-scale spoken language processing because of the time complexity. 
Moreover, we still don't have much knowledge to tell how much top-down feedback is actually 
helpful to improve the speech recognition process. As an engineering point of view, semi-tightly- 
coupled systems are quite feasible for large complex systems under the current technology. In this 
regard, SKOPE project adopts a semi-tightly-coupled integration technique between speech and 
language processing, especially morphological processing. 



8 Conclusions 

This paper presents a morpheme-level integration architecture of speech and natural language in 
a connectionist continuous speech recognition model for agglutinative languages such as Korean. 
Our main contributions are to present the morphologically conditioned semi-tight integration model 
that can support sophisticated phonological/morphological processing in the integration of speech 
and language, which is essential for morphologically complex agglutinative languages. Also, the 
SKOPE integration architecture is a first attempt to develop a morphologically general integration 
model using the connectionist speech recognition paradigm. 

The SKOPE speech and language integration architecture has many novel features for speech 
and natural language processing. First, the diphone-based TDNN proposes a nice sub-word unit of 
recognition, well reflecting the Korean phonetic characteristics. Secondly, the morphological analy- 
sis combined with the declarative phonological rule modeling is well suited to the phonetic spelling 
into the orthographic morpheme mapping, which is an essential task for every spoken language pro- 
cessing model. Finally, the trie-structured HMM indexing for UPM dictionary enables the Viterbi 
style search to be applied to the thorny morpheme segmentation and lexical decoding problem, and 
also provides natural integration of symbolic natural language processing techniques with proba- 
bilistic decoding schemes. The experiments show that the final morphological analysis performance 
from continuous speech is over 80.6% in a middle-vocabulary speaker-dependent recognition task, 
which is very promising in considering the continuous speech and the combination of several steps of 
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performances such as diphone spotting, lexical decoding and morphological/phonological analysis. 
Since the integration architecture is based on general linguistic notion of phoneme and morpheme, 
the architecture is not restricted to Korean. The SKOPE architecture can be extended to any ag- 
glutinative language which has clear-cut morphological boundaries such as Japanese, and possibly 
to other Indo-European languages which exhibit well-developed morphological phenomena such as 
German. We are now extending the integration technique to Japanese. 
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